Ryegrass, a potential source of gluten-like proteins

Sophia Escobar-Correas

CSIRO Agriculture & Food

Introduction

Hello! my name is Sophia, I am a molecular biologist working in proteomics, currently a Postdoctoral fellow. Before Data School, I coded on Macro (Excel). I used to spend a lot of time cleaning and tidying protein data, I always felt like I could do it faster if I had programming skills. These weeks learning R have changed my daily work. The possible things that we could do have opened my mind to a new perspective of my research.

My Project

I am working in the study of ryegrass, a potential source for gluten peptides contamination. Gluten refers to a class of storage proteins found in cereal grains, including wheat, rye, barley, and oats. Consumption of these gluten proteins leads to an autoimmune response in the case of coeliac disease. Previous studies have identified gluten-like proteins in ryegrass. Since this is a common weed found in grain fields, there is a possibility of cross-contamination. First of all, I need to characterize the gluten proteins and peptides with ryegrass origen. To do so I have performed Data-dependent mass spectrometric analysis. The results of this study provided identification of 3162 proteins, and 8231 peptides. Now what I need to do is found how many of them are gluten-like.

Preliminary results

I will analyse the amino acids composition of the proteins of my database. Since gluten proteins have a high composition of the amino acids Glutamine (Q) and Proline (P). I will search for all proteins that have over 20% Glutamine.

Tables
Table 1: Protein Database
N Accession Name Sequence
1 spP4910614331_MAIZE 14-3-3-like MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEEGRGNEDRVTLIKDYRGKIETELTKICDGILKLLETHLVPSSTAPESKVFYLKMKGDYYRYLAEFKTGAERKDAAENTMVAYKAAQDIALAELAPTHPIRLGLALNFSVFYYEILNSPDRACSLAKQAFDEAISELDTLSEESYKDSTLIMQLLRDNLTLWTSDISEDPAEEIREAPKRDSSEGQ
2 spQ84Q72HS181_ORYSJ 18.1 MSLIRRSNVFDPFSLDLWDPFDGFPFGSGSRSSGSIFPSFPRGTSSETAAFAGARIDWKETPEAHVFKADVPGLKKEEVKVEVEDGNVLQISGERSKEQEEKTDKWHRVERSSGKFLRRFRLPENTKPEQIKASMENGVLTVTVPKEEPKKPDVKSIQVTG
3 spP69555PSBH_WHEAT Photosystem MATQTVEDSSKPRPKRTGAGSLLKPLNSEYGKVAPGWGTTPFMGVAMALFAIFLSIILEIYNSSVLLDGILTN
4 spP36886PSAK_HORVU Photosystem MASQLSAMTSVPQFHGLRTYSSPRSMATLPSLRRRRSQGIRCDYIGSSTNLIMVTTTTLMLFAGRFGLAPSANRKATAGLKLEARESGLQTGDPAGFTLADTLACGAVGHIMGVGIVLGLKNTGVLDQIIG
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde MAGAAAASAAAAAVASGISARPVAPRPSPSRARAPRSVVRAAISVEKGEKAYTVEKSEEIFNAAKELMPGGVNSPVRAFKSVGGQPIVFDSVKGSRMWDVDGNEYIDYVGSWGPAIIGHADDTVNAALIETLKKGTSFGAPCVLENVLAEMVISAVPSIEMVRFVNSGTEACMGALRLVRAFTGREKILKFEGCYHGHADSFLVKAGSGVATLGLPDSPGVPKGATSETLTAPYNDVEAVKKLFEENKGQIAAVFLEPVVGNAGFIPPQPGFLNALRDLTKQDGALLVFDEVMTGFRLAYGGAQEYFGITPDVSTLGKIIGGGLPVGAYGGRKDIMEMVAPAGPMYQAGTLSGNPLAMTAGIHTLKRLMEPGTYDYLDKITGDLVRGVLDAGAKTGHEMCGGHIRGMFGFFFTAGPVHNFGDAKKSDTAKFGRFYRGMLEEGVYLAPSQFEAGFTSLAHTSQDIEKTVEAAAKVLRRI
Note: 5 examples of proteins found in the database. The column Sequence indicates the amino acids (letter code) that make up each protein.

Look for amino acids Q and P.

Table 2: Aminoacid composition
N Accession Name totalAA Qcomp Q100 Pcomp P100
1 spP4910614331_MAIZE 14-3-3-like 261 6 2.30 7 2.68
2 spQ84Q72HS181_ORYSJ 18.1 161 4 2.48 12 7.45
3 spP69555PSBH_WHEAT Photosystem 73 1 1.37 5 6.85
4 spP36886PSAK_HORVU Photosystem 131 5 3.82 5 3.82
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde 478 8 1.67 27 5.65
Note: totalAA = Number of total amino acids of the protein
Qcomp= Number of Glutamine found in the protein
Q100= Porcentage of Glutamine in the protein
Pcomp= Number of Proline found in the protein
P100= Porcentage of Proline in the protein

Working with Protein Data

Plotting
Glutamine and Proline composition in RyegrassGlutamine and Proline composition in Ryegrass

Figure 1: Glutamine and Proline composition in Ryegrass

My Digital Toolbox

To work with Protein Databases, that are usually in the format .fasta. I have used the package Biostrings. For tyding the data,Tidyverse (my new best friend) and plyr. Other packages I have used are: dplyr and stringr. For visualization ggplot and gganimate.

Favourite tool

My favorite package is tidyverse. Only with learning a few functions in the first days of Data School already made my daily work much easier. It was love at second sight. I was able to clean and tidy my data. The functions I used the most are mutate, join and of course pipe %>%. Moreover, another of my favorite parts of working with R is using Regex, learn this was so useful for making scripts.

My time went …

tidying protein data manually.

parts of your project take the most time and effort? Were there any surprising challenges you encountered, and how did you solve them?

Next steps

I will keep working with R, i think there is a lot I havent try yet. I want to practice more creating functions. In the future I will like to create a script that identifies non-gluten proteins, proteins that are similar or that generates and immune reaction on patients. But for that I will have to learn more things in R, like working with API.

And maybe some day start with python……

My Data School Experience

This summary is mostly about your project. However we would also like to hear about other parts of your Data School experience. What aspects of the program did you really enjoy? Have you tried applying the skills you have learned in your daily work? Have you been able to transfer this knowledge to your team members? Any descriptions of the personal impact the program has had are welcome here as well!